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Abstract 

We introduce an alternative to the notion of ‘fast rate’ in Learning 
Theory, which coincides with the optimal error rate when the given 
class happens to be convex and regular in some sense. While it is 
well known that such a rate cannot always be attained by a learning 
procedure (i.e., a procedure that selects a function in the given class), 
we introduce an aggregation procedure that attains that rate under 
rather minimal assumptions - for example, that the L q and L 2 norms 
are equivalent on the linear span of the class for some q > 2, and the 
target random variable is square-integrable. 


1 Introduction 

The focus of this article is on the question of Prediction: let T be a class 
of functions defined on a probability space (fi, \i) and let X be distributed 
according to ^. Given an unknown target random variable Y, one would 
like to find some / £ T for which, on average, predicting f(X) instead of 
Y is the most ‘cost effective’. If the pointwise cost is measured according 
to the squared loss, that is, if the price of predicting f(X) instead of Y is 
(/(V) — Y) 2 , the goal is to identify, or at least approximate in some sense, 
the behaviour of the function that minimizes in T the risk E (f(X) — Y) 2 , 
where the expectation is taken with respect to the joint distribution of X 
and Y on Q x R. With that in mind, set 

f* = argmin /&F E (f(X) - Y) 2 , 
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and assume, for the sake of simplicity, that the minimizer exists. 

It should also be noted that there are other reasonable choices for the 
pointwise cost of predicting /(A) instead of Y. and although our results are 
presented only for the squared loss, they may be extended to other convex 
loss functions, following the path of m- 

Unlike standard questions in Approximation Theory, in the prediction 
framework one has limited information: a random sample se¬ 

lected independently according to the joint distribution of X and Y. The 
hope is that a typical sample may be used to produce a (random) function 
in T that has almost the same ‘predictive capabilities’ as the minimizer f*. 

Definition 1.1 For every integer N and a base class T, a learning proce¬ 
dure is a function f : (!1 x M)^ —> T. 

Setting f = T((Aj, Yi)f = f), and given 0 < 6 < 1 and a set of potential 
targets y, the learning procedure 'k performs with an error rate of£ p (J~, N, 5) 
if for every reasonable class of functions F C an & Y ^y, 

E ((/(A) - y) 2 |pQ,y,X=i) < E(/*(A) - Y) 2 + £ P (F, N,S) 

with probability at least 1 — <5 relative to the samples (Aj, Yf)^L 1 (i.e., with 
respect to the N-product of the joint distribution of X and Y endowed on 
{Fl x R) n ). 

One would like to identify the ‘best’ learning procedure \k, in the sense that 
the error rate £ p is as small as possible, find which features of F and y 
govern £ p , and study the way in which £ p scales with the sample size N. 

Although it is not obvious from Definition 11.11 the effect the set of ad¬ 
missible targets y has on the error rate £ p is rather small. In standard 
scenarios, y consists of all random variables that are bounded by 1 or, 
alternatively, that have rapidly decaying tails (e.g. - subgaussian or subex¬ 
ponential). However, as will be explained later, this type of condition can 
be relaxed considerably, and y may be as large as the L 2 unit ball on the 
underlying probability space, rather than the L <*, one. 

1.1 Fast and slow rates 

One frequently encounters in literature the terms ‘fast rate’ and ‘slow rate’, 
used to describe the behaviour of a learning procedure as a function of 

1 This is sometimes called a proper learning procedure. 
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the sample size N. Unfortunately, the meaning of the two is somewhat 
ambiguous, and is often misinterpreted. 

A common misapprehension is that ‘fast rate’ means that £ p scales as 
1 /N, and that a ‘slow rate’ implies that £ p is of the order of 1 ./VN\ in 
reality, the situation is different. Indeed, on one hand, it is straightforward 
to construct examples of classes that are simply too rich for a rate of 1/N 
(or even of 1 /V~N, for that matter), even in the realizable case, when Y € T\ 
on the other, the ‘size’ of T does not capture the correct behaviour of £ p : if 
T = {/i,/ 2 } and Y happens to be a 1 /VN perturbation of the mid-point 
(/i + / 2 )/2, no learning procedure can achieve an error rate that is better 
than c/'/N with probability at least 3/4 using N sample points and for a 
suitable absolute constant c (see, e.g., [1] for a more precise statement). 

Thus, a reasonable definition of the terms ‘fast rate’ and ‘slow rate’ must 
reflect the fact that the error rate is highly affected by the ‘location’ of the 
target, as well as by the ‘complexity’ of T. 

To avoid potential ambiguity, we will refrain from using the terms ‘fast 
rate’ and ‘slow rate’ in what follows. Instead, we will adopt the notion 
of ‘optimistic rate which is, roughly put, the rate one encounters when 
the location of the target is favourable, and should be considered as a more 
accurate version of the intuitive ‘fast rate’ (see Section fL2l for the definition). 
For example, if F = {/i, / 2 }, the optimistic rate is of the order of 1/IV rather 
than 1 /y/~N, seemingly ignoring the possibility that Y is a perturbation of 
the mid-point (/i + / 2 )/2 as above. 

Note that this example shows that the optimistic rate may be, at times, 
unreachable by any learning procedure. Hence, if there is any hope of con¬ 
structing a procedure that always attains the optimistic rate regardless of 
the location of the target, that procedure must be allowed the flexibility of 
selecting functions that are outside the given base class J~. Such procedures 
belong to the model selection aggregation framework. 

Definition 1.2 For an integer N, an aggregation procedure is a map T : 
(H x M.) n —>• L 2 (/i). The procedure has an error rate of £ p sg (J 7 , N , 6) if for 
every reasonable class of functions F and every target Y £ y, 

E ((/( X) - Y) 2 \(Xi,Yi)f =1 ^ < E(f*(X) - Y) 2 + £p gs (J r , N, <5) 

with probability at least 1—<5 relative to the N-product of the joint distribution 
of X andY, and for f = '®({Xi,Y i )f =l ). 

Detailed surveys on the aggregation framework in a broad context may be 

found in mum- 
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Thus, rather than restricting one to a learning procedure, i.e., forcing 
one to select functions from J 7 , the goal here is to construct an aggregation 
procedure that attains the optimistic rate under minimal assumptions on F 
and y. 

The analysis of the aggregation procedure we will introduce below re¬ 
quires the use of some auxiliary classes that are connected to the given base 
class F\ those will be denoted by U,V and H. To avoid confusion, in what 
follows we will denote ‘generic’ function classes by F and 1C. 

1.2 The optimistic rate 

The definition of the optimistic rate is based on the method developed 
in m eh] for the analysis of the Empirical Risk Minimization procedure 
(ERM). We will outline the essentials of this method in what follows, but 
refer the reader to US EH] for a more detailed description of the parameters 
involved, their role in the analysis of ERM and the way in which they may 
be computed in specific applications. 

Definition 1.3 Given a sample (Xi,Yj)f =1 and a base class F, the empirical 
minimizer in F is 

1 N 

f € arginine— ^(/(W) - Yi) 2 , 
i=l 

assuming, of course, that a minimizer exists. 

From here on we will denote by Py h the empirical mean h(Xi, Yi). 

Recall that f* = argminy e jrE(/(W) — Y) 2 , consider the squared excess 
loss functional relative to F and Y, 

£f(X, Y) = ( f(X ) - Y) 2 - (f*(X) - Y) 2 , 

and observe that the minimizer in T of P/v(/ — Y) 2 is also a minimizer in 
F of Pat£/. Thus, P N C f < 0, simply because Cf* =0, and, in particular, 

f G {f € F : P N C f < 0}. 

It follows that if (X,, Yf)f =l is a sample for which 

{f eF:EjC f >r,}c{f eF:P N £ f >0}, (1.1) 
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then 


E ((/(X) - YfKXi^i Li) < E(/*(X) - Yf + n, 
which is the type of result one is looking for. 

To obtain (11.11) . note that for every / E P and every sample (Xj, Y))^, 


1 N 

+ 2E(r(X) -Y)(f- /*)(X) (1.2) 

i—1 
i N 

- £(/* (x,) - !■)(/ - /*)(*) - E(f(x) - r)(f - /*) W 


- 2 


2=1 


i.e., Pn^-j is lower bounded by a sum of (random) quadratic and multiplier 
components, and a deterministic term, 2E(/*(X) — Y)(f — /*)(X), which 
calibrates the ‘location’ of the target Y relative to P. 

The optimistic rate is defined based on the belief that the location of Y 
is favourable in the sense that for every /£ J, 


E(/*(X) — Y)(f — /*)(X) > 0. (1.3) 

It is straightforward to verify that (11.31) is satisfied in two important cases. 
Firstly, when P C L 2 happens to be closed and convex, in which case, m 
follows from the characterization of the metric projection onto a closed, 
convex set in an inner-product space. Secondly, for an arbitrary class P and 
a target Y = /*(X) +£, where f* € P and £ is mean-zero and independent 
of X. 


Definition 1.4 A class P satisfies a small-ball condition with constants Ko 
and e, if for every / 1 , /h G P U {0}, 

Pr(\fi - / 2 | > K 0 II /1 - /allL a ) > £• (1-4) 

The snrall-ball condition is a rather minimal assumption on P - it is a uni¬ 
form lower estimate on the probability that |/i —/ 2 I/H /1 —/ 2 IU 2 sufficiently 
far from zero for every pair of distinct functions / 1 , /2 € P U {0}. 

One may find in PUDS] several examples of classes that satisfy a snrall- 
ball condition. For our purposes, the most significant example is when q > 2 
and the L q and L 2 norms are L-equivalent on P, in the sense that for every 
fly f 2 £ p u {0}, ||/i - f 2 \\L q < L\\fi - /2 11z .2 • In such a case, the Paley- 
Zygnrund inequality [6] shows that (11.41) holds for constants k,q and e that 
depend only on q and L. 
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Let D be the unit ball in L 2 ((F). Given a class of functions F C L 2 (n), 
set {Gf : f € F} to be the canonical gaussian process indexed by F and 
put 

E||G||j- = sup{E sup Gf\F' C F, F' is finite }. 

fer 

Definition 1.5 For F C L 2 , let star(J r ) = {A/ : 0 < A < 1, / S F} be the 
star-shaped hull of F around 0, and set F — F = {f — h : f,h £ F}. Let 

u ={^:h,heF) 

and set H = star (17 — U). 

Finally, for C > 0, let 

r Q,i(F , £) = inf {r > 0 : E||G||(^_ ff ) nrD < (tVn} , 

and 


rQ, 2 {F, C) = inf < r > 0 : E sup —L er iw{Xf) 

y w£(H-H)nrD V-/V “ 

where are independent, symmetric {—1,1 }-valued random variables 

that are independent of and the expectation is taken with respect to 

both and 

Note that U is only slightly richer than F: it contains F and all the 
midpoints of intervals whose ends belong to F. If F happens to be convex, 
then U = F, but in general, U is much smaller than the convex hull of F. 
Also, FI = star([7 — U ) is star-shaped around 0, centrally symmetric, and 
contains F — F\ hence, both F and F — F belong to H — H. 

The parameters tq i and measure the ‘local’ complexity of the in¬ 
dexing class: from a statistical point of view, the two capture the correlation 
of the indexing class with various forms of random noise. The reader may 
find a more detailed explanation of their role in [18] and QSI- 

it should be noted that the definitions of and vq p in [19] appear 
to be slightly different from the ones defined above. However, the reason 
for the difference is that in [19] one considers a convex base class, while 
here F need not be convex. If F happens to be convex then U = F, 
U — U = F — F is convex and centrally symmetric, and FI = F — F\ 


< c tVn \ , 
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therefore H — H = 2 H = 2 (F — J 7 ) and the definitions above coincide with 
the ones from m up to a factor of 2, which is only an issue of normalization. 

The third and final complexity parameter is also a minor modification 
of a similar parameter from mm- It will be used to study the multiplier 
component in the decomposition (11.211 of the excess squared-loss functional. 


Definition 1.6 Let F C L 2 be the given base class and set U and H as 
above. For every uq £ U consider the random, function 


<i>F,N,uo ( r ) 


-7= sup 

V -Ar {w£sta,r(U—uo)r\rD} 


N 

-Y i )w(X i ) 

i=l 


and set 

r M (F,C,S.u 0 ) = inf |r > 0 : Pr (cj>-r :N ,u 0 {r) < r 2 (VN^ > 1 - dj . 

The importance of and the way it may be used to upper bound the 
multiplier component can be seen in the next lemma from M- 

Lemma 1.7 Let uq£U,0<5<1,k> 0 and set r = 2 ac/4, 5/2 , uo). 

Put ^ = rto(^) — Y, and given a sample (Xi,^)^ set = uo(Xi) — Y{. 
Then, with probability at least 1 — 5, for every u € U that satisfies \\u — 
^0^2 ^ r > one h as 


jj ~ uo)(Xi) -E£(u 


u 0 )(X) 


< Kmax{||a 



With all the complexity terms in place, one may derive an error estimate 
for ERM, performed in any subset of U that satisfies the ‘optimistic’ assump¬ 
tion. Indeed, the following is a minor modification of Lemma 5.2 from m, 
originally formulated for a convex base class, though it is straightforward to 
verify that the convexity condition may be relaxed. 


In the setup we are interested in, T C L 2 is the given base class, U and 
H are defined as above and H satisfies a small-ball condition with constants 
ko and e. Fix Y £ L 2 and V C U, put v* = argmin„ G y||u — Y\\l 2 and let v 
be the empirical minimizer in V of the squared loss functional. 


Theorem 1.8 There exists an absolute constant cq and for every kq > 0 
and 0 < e < 1 there exist constants c \, C 2 and C 3 that depend only on and e 
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for which the following holds. If for every v € V, E(u*(X) — Y)(v — v*)(X) > 
0, then probability at least 1 — 5 — 2 exp(— cq£ 2 N), 

E ((u(X) - Y)\Xi,Yi)f =1 ) < E(u*(X) - y) 2 ) + r 2 (v*), 


where 


r(v*) = 2max{r M (J 7 ,ci,6/2,v*),r Qtl (J 7 ,C2),rQ ! 2(J :7 ,c 3 )} . 

Since v* is not known, one has to use a uniform version of r(y*) as a 
complexity parameter. This uniform version is the optimistic rate: 

Definition 1.9 Given a base class F, the optimistic rate in F is defined by 
r 0 pt(F,S,ci,C 2 ,c 3 ) = 2 sup max {rlf(F, c 1} 6/2, u 0 ), rg tl (F, c 2 ), rg >2 (F, c 3 )j . 

uo GU 

(1.5) 

In what follows, H = star(17 — U) will satisfy the small-ball condition with 
constants hq and e, and ci, C 2 and c 3 will be chosen as constants that depend 
only on ko and e. To avoid cumbersome notation, we will not specify in what 
follows that r opt depends on F, 5 and the constants ci, C 2 , c 3 , but their choice 
will be made clear. 

When F happens to be convex, U = F, and upon selecting V = F it 
follows that E (f*(X) — Y)(f — f*)(X) > 0 for every / € F. Thus, Theorem 
11.81 extends the main result from |TSj on the performance of ERM in a 
convex class and relative to the squared loss. Moreover, since the ‘optimistic 
assumption’ includes the choice of Y = f*(X) + £ for f* € F and £ that is 
independent of X, the results from [T2J fT8] indicate that r 2 pt captures the 
minimax rat^\ in F under mild structural assumptions on that class. 

Therefore, the optimistic rate r opt is defined as what is essentially the 
best possible rate that any learning procedure may achieve in F when the 
target Y is in a ‘good location’ relative to F in the sense of f) 1. 3 h . And, 
when F happens to be convex, every target Y £ L 2 is in a ‘good location’. 

Having said that, let us emphasize once again that the problem we wish 
to address occurs when the location of the target is less favourable, and in 
which case no learning procedure can achieve the optimistic rate. 

2 Roughly put, the minimax rate is the best possible error rate one may achieve by any 

learning procedure, i.e., by any 9/ : (fi x K) —»• F. 
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We will show that there is an aggregation procedure that always achieves 
the optimistic rate when the L 2 and L q norms are equivalent on span(i ? ) 
for some q > 2, and Y £ L 2 - thus overcoming the possible problem that 
may occur when the target Y is not in a ‘good location’ relative to the given 
class. Let us stress that what allows one to attain the optimistic rate is that 
the procedure used in an aggregation procedure, and thus may take values 
in L 2 (i-i ), rather than a learning procedure, which is restricted to values in 
F. 

Theorem 1.10 For every L > 1 and q > 2 there are constants cq, ci, C2 and 
C 3 that depend only on L and q for which the following holds. Let F C L 2 
be the given base class and let U and H be as above. Assume that for every 
w € H — H, ||u ;||l < L||tc||i 2 . Then, there is an aggregation procedure 
'L : (11 x L 2 {pf) for which, for every Y € L 2 , with probability at least 

1 — 6 — 2exp(—coX), 

E(/(X) - Y) 2 \(Xi,Yi)f =1 ) < E(/*(X) - Y) 2 ) + r opt , 
where f = ^((Xj, Yf)f =1 ) and 

r Q P t = 2 sup max {r 2 M (F, c l7 8/4, u 0 ),rQ A (F,c 2 ),rQ 2 (F,c 3 )} . (1.6) 

uq&U 

To put Theorem 11.101 is some perspective, note that in the standard 
framework of aggregation, l 7 is a finite dictionary and both the dictionary 
and the target are bounded in L <*, (see, for example, [25i nni and references 
therein). Within that framework one has the following: 

Theorem 1.11 fI7]/ There exists an aggregation procedure VH for which the 
following holds. Assume that F is a finite dictionary consisting of functions 
that are bounded by 1, and assume that the target Y is bounded by 1 as well. 
Then, for every x > 0, with probability at least 1 — 2exp(— x), 

E ((/(X) - y) 2 |(Xi, Yi)? =1 ) < E(/*(X) - Y) 2 + c(l + 
where f = ^((Xj,^)^). 

In comparison, when applied to a finite dictionary, Theorem 11.101 leads 
to the following: 
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Corollary 1.12 Let F be a finite dictionary and set H as above. Assume 
that H — FI is L-subgaussian, in the sense that for every w\, W 2 € FI — H, 
and every p > 2, ||uii — u> 2 ||l p < Ly/pW^i — W2\\l 2 - 7/T £ L q for some q > 2 
then with probability at least 1 — 5, 

E(/P0 - T) 2 |(JQ, Yi)l i) < E - Yf) + Cl || r - y\\l q l -^p-, (1-7) 

where c\ depend only on q, L and 6. 

The proof of Corollary 11.121 will be presented in Section 14.11 It is well 
known that the best error rate a learning procedure may attain for a finite 
dictionary is of the order of yTog |F|/IV (see, e.g. [25j [TU]), which is not 
remotely close to r op t, as the latter is of the order of (log |.F|)/1V. Moreover, 
the error rate in (11.711 scales well with ||/* — y||^ : it tends to zero when Y 
approaches F and the problem becomes ‘more realizable’, in which case one 
expects a zero-error when N > c log \F\. 

The aggregation procedure we will introduce here is a ‘close family mem¬ 
ber’ of the one from m , but with many significant and unavoidable changes. 

It should be noted that Audibert obtained in [ 2 ] the same estimate as in 
Theorem 11.111 but using a different aggregation procedure - the empirical 
star algorithm , and it is not clear whether it is possible to obtain a version 
of Theorem 11.101 using an analog of the empirical star algorithm. Moreover, 
the empirical star algorithm involves running ERM on the star-hull of F 
and the empirical minimizer; therefore, one has to apply ERM to an infinite 
class even if the dictionary is finite. In contrast, the procedure suggested 
here uses ERM on a well-chosen V C U; hence, if F is finite, so is V. 

Unlike the bounded case, aggregation in unbounded situations was not 
fully understood. The benchmark result in that direction is due to Audibert 
[3] and independently to Juditsky, Rigollet and Tsybakov [ 8 ], who obtained 
the following estimate on the expected risk when the class is bounded but 
the target may be unbounded: 

Theorem 1.13 There is an aggregation procedure : (fl x M) N —» L 2 (n) 
for which the following holds. Assume that F is a finite dictionary consisting 
of functions bounded by 1 and assume that Y e L q for q > 2. Then setting 

e(e((/(X) -YfKX^YJl,)) <E(f*(X)-Yf+C(q,\\Y\\ Lq ) } 

moreover, this estimate is optimal - up to the constant C. 
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It is interesting to see the subtle differences between the assumptions 
used in Theorem 11.131 and the ones from Corollary 11.121 In the former, 
the class is assumed to be bounded in L 00 , while in the latter, the class 
is L-subgaussian, which is a different type of condition: it implies norm 
equivalence rather than having a bounded diameter with respect to some 
(possibly strong) norm. 

As noted in statistical procedures may behave in a very different way 
when one assumes even a weak norm equivalence rather than an L a0 bound, 
and the same phenomenon is true here as well: although the dictionary 
may consist of unbounded functions, the norm equivalence gives sufficient 
information to ensure an error rate of IV -1 log \F\ rather than much slower 
(A^ _1 log |T"|) 2 /(' ?+2 ). Moreover, the error rates in Theorem 11.111 and Theo¬ 
rem 11.131 do not scale well with the distance between Y and F and do not 
improve even when the problem is arbitrarily close to being realizable. 

We end this introduction with some notation. Throughout, absolute 
constants are denoted by c, ci..., etc. Their value may change from line to 
line. When a constant depends on a parameter a it will be denoted by c(a). 
A < B means that A < cB for an absolute constant c, and A B implies 
that the constant depends on the parameter a. The analogous two-sided 
inequalities are denoted by A ~ B and A ~ a B. 

For a set A, let 1 a be its indicator function and put |A| to be its cardi¬ 
nality. 

Finally, let us mention that we will abuse notation and write ||z||l 2 for the 
L 2 norm of the function z, without specifying the exact probability space on 
which the integration is performed. For example, ||/— YW^ = E (f(X)—Y) 2 , 
while ||/ — /*||| 2 = E(/ — f*) 2 (X). We will denote the unit ball in L 2 by 
D and the unit sphere by again, without specifying the underlying 

space. 

2 The aggregation procedure 

The aggregation procedure presented here follows the general path of m- 
though with many essential modifications. The core difference between the 
method of proof used in m and the one we employ here is unavoidable, 
as the former is based on a two-sided concentration estimate on empirical 
means which is simply false for heavy-tailed functions. Most notably, two- 
sided empirical estimates on L 2 distances play a central role in mi and 
one has to find an alternative to these concentration-based bounds. To 
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that end, we will introduce an empirical ‘isomorphic’ upper estimate on L 2 
distances, which is based on the idea of median-of-means, and which will be 
complemented by an ‘almost isometric’ lower bound. Both bounds hold for 
any two class-members that are not ‘very close’, and under a weak moment 
assumption: that for some q > 2 the L q and L 2 norms are L-equivalent on 
the class. These results are of independent interest and are likely to have 
many other applications. 

The accurate formulation and proof of the ‘isomorphic’ estimate may be 
found in Section 13.21 while the ‘almost isometric’ lower bound is presented 
in Section nm 

The aggregation procedure consists of two stages. Given a base class 
F, one must first identify a subset V C. F, which is selected in a data- 
dependent way, and which consists of well-behaved functions in a sense that 
will be clarified below. Then, in the second stage, one applies ERM to the 
set of midpoints of pairs of elements in V (a set which contains V as well), 


i.e., to 



using a second, independent sample. 

We begin with the following observation: 

Lemma 2.1 Let C > 1, r > 0 and 0 < 6 < 1/32, and consider V C F that 
satisfies the following: 

• f* € V (where, as always, f* = argminj gi r||/ — Y\\l 2 ). 

• For every v^V, ||u — Y||| 2 < ||/* — Y||| 2 + max{CV 2 ,0diam 2 (V, £ 2 )}- 

LetW = {(ui T U 2 ) / 2 : V\,V 2 € V}, put w* = argmin^^Hra — Y\\l 2 and set 
w to be the empirical minimizer in W. 

If (X i ,Y i )fL 1 is a sample for which, for every w € W, 



( 2 . 1 ) 


<-^ ^2 ~ Y rY ~ ( w*(Xi ) - Yi ) 2 ) + max { Cr 2 , 6\\w - w* 



E ((w - Y) 2 \{Xi,Yf)? =1 ) < E (f*(X) - Yf + 2 Cr 2 . 


then 
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Lemma [2. II implies that if V consists of functions whose excess risk (relative 
to F ) is either small (< Cr 2 ), or, alternatively, at least smaller than a fixed 
proportion of the square of the diameter of V, and if one is given a sample 
for which the oracle type inequality (|2.1D holds, then ERM performed in W 
using that sample selects a function whose excess risk is at most 2 Cr 2 . 

Naturally, at this point Lemma 12.11 is somewhat speculative, as it con¬ 
tains two substantial ‘if’s’. For the lemma to be of any use, one has to 
construct V using a random sample and without knowing the identity of 
/*, and then to establish the oracle type inequality in W using a second, 
independent sample. 

Proof. Set dy = diam(V, L 2 ), note that diam(W, L 2 ) = dy and that 

TV TV 

- - Yif - - X>*PQ) - li ) 2 < 0 . ( 2 . 2 ) 

i— 1 2—1 

Consider two cases: firstly, if dy < yfCr then by (12.11) 

E ((w(X) - < E(w*(X)-Y) 2 + Cr 2 < E (f*{X)-Y) 2 +Cr 2 

because /* G V C W. 

Secondly, assume that dy > \[Cr. Since f* € V, there is some v € V 
for which ||u — /*||l 2 > dy/2. Set w = (v + /*)/2 € W and observe that by 
the uniform convexity of the L 2 norm and the definition of V, 

Ik* - y\\ l 2 <|k - y||| 2 = ^||u - r||i 2 +1||/* - y\\l 2 -\\\v- f *IIL 

<11/* - y||l 2 + max {Cr 2 ,6d 2 y} - 

<nr-^ni 2 + cr 2 -(^-^ 4 . 

Combining this with (12.11) applied to w, and with (12.21) . and recalling that 
9 < 1/32, 

e ((u)(x)-y) 2 |(x i ,y,)ili) 

<E(f*(X) - Y) 2 + Cr 2 + 6d 2 v + (E(u;*(X) - Y) 2 - E (f*(X) - Y) 2 ) 

<E (f*(X) - Y) 2 + 2Cr 2 - " 20) 4 

<E(f*(X) — Y) 2 + 2Cr 2 . 
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Next, we shall identify sufficient conditions that allow one to construct 
the set V as in Lemma m in a data-dependent way. 

2.1 The construction of V 

Given a class F. recall that U = {(/i + fi )/2 : /i, /2 € F}. 

What will assume the role of the empirical mean ^2^ = i{f—h) 2 (Xi) as a 
way of estimating L 2 distances, is the following median-of-means functional, 
which is more stable than the empirical mean when dealing with heavy-tailed 
functions: 

Definition 2.2 Let 1 < £ < N and set Ij = {£j + 1, ...,£(j+ 1)} C {1,..., N} 
for 0 < j < [N/£\ = M — 1. For v € M. N let Med^(u) to be the median of 
the vector of means (£ _1 v i) I j I =Q 1 £ ■ 

Thus, /o,..., Im- 1 are disjoint subsets of {1,..., N}, each of cardinality £, and 
Med^(u) is the median of the means taken over the ‘blocks’ Ij. 

Definition 2.3 Fix rjj > 0, uq € U, 1 < £ < N, 0 < a. < 1 < ft, and set 
p = (a/20/3) 2 < 1/400. 

Let A Uq be the set of N-samples (Xi, Y))F 1 for which the following holds: 

• for every u G U 

1 N 

- (Xi) - Yi)(u - u 0 )(Xi ) - E(uoPO " Y)(u - u 0 )(X) 

2—1 

<pmax {rfj, ||u - u 0 ||i 2 } ! 

• if u\,U 2 € U and ||ui — U 2 \\l 2 > W, then 

^ “ u 2?( x i > (! - P)IK - «2|li a ; 

• if U\,U 2 € U and ||ui — U 2 \\l 2 > r u> then 

a||«i - u 2 \\l 2 < Med £ (|ui - u 2 \{Xi))f =1 < /3||«i - u 2 \\l 2 , 
and if ||ui — U 2 ||l 2 < fu then 

Med £ (|ui - u 2 \(Xi))f =1 < /drjj. 
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At this point, the right choice of rjj, £, a and f3 is not clear, nor that 
A Uq is nonempty, for that matter. 


At last, we are ready to define the aggregation procedure: 

Definition 2.4 Let (Xj, Y,) 2 fi\ be a 2N-sample and set T>\ = (Xi,Yi)f =l 
and T >2 = (Xi,Yi)™ N+1 . Recall that 


and let 

m) 


i N 

f = argmin /eF — ^(/(A*) - Yfi 2 , 


i= 1 


N N 

f€F: N E^) - y *) 2 ^ ^ E(to) - 


N 

i— 1 z=l 

+3 max / rfj, pa~ 2 Nl.ed 2 (|/-/|(Aj 


TV 


Z— 1 


(2.3) 


Set 

W(V i) = : Ul>V2 € y(P!)| 

and define the aggregation procedure by 


w = 


1 

argmm^gw-^) — 


2TV 


i=iV+l 


(2.4) 


Theorem 2.5 Let re* = argrnin we y ( '. Dl ^K(w(X) — Y) 2 . If V i G Af* and 
L ?2 € .Au,*, then V{V\) and W(V i) satisfy the conditions of Lemma \2.1\ for 
6 < 1/32, C = 6 and r = rjj. In particular, for such a 2N-sample, and if w 
is the empirical minimizer selected in W(T>\) using the sample T> 2 , one has 

E ((w(X) - Yf |Z>i) < E (f*(X) - Y) 2 + Sri. 

Theorem 11.101 follows from Theorem 12.51 once one shows that can be 
selected to be r opt for the right choice of constants, and that the probability 
of the events A Uo is sufficiently high for every uq G U. 


The rest of this section is devoted to the proof of Theorem 12.51 The 
proof that each A Uo is a large event will be presented in Section [3l 

Remark 2.6 Note that if F is finite, so is WfiDfi) - which may be much 
smaller than F. Thus, if the original dictionary is finite, then unlike the 
empirical star algorithm, the second step in the aggregation procedure is 
carried out on a finite set. 
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2.2 Proof of Theorem 12.51 

Given a set /C C U, put h* = argmin feg ^E(/i(X) — Y) 2 and set 

Y ) = (h(X) - Yf - (h*(X) - Y) 2 

to be the square excess loss functional associated with 1C. 

Lemma 2.7 If (X*. Yf)P =l € Ah* then for every h € 1C, 

E < P N tf + 3max{r£,p||h - h* |||j. (2.5) 

Proof. Set £(X,Y) = h*(X) - Y, let (X u Y;)^i € A h * and put & = 
h*(Xi ) — Yi. Note that for every h € 1C, 

N N 

PNC h = a? E - >»*)(**) + /v E(^ - ^*) 2 (^) 


i —1 


>E£^ - 2 


N 


i= 1 


JV 






2=1 




2=1 

\7V 


Recall that if (Xj,y r j)v_ 1 e ^4/,* and ||h — /i*||l 2 > rjj, one has 


N 


^h-hl\X % )>{l- P )\\h-h*\\l 2 , 


i —1 


and 


thus 


N 


-^^-r)(I,)-E^-r)(I) 


2=1 




PiV^>E4-3p||h-h*||| 2 . 
otherwise, if ||h — h*||i 2 < rjj, 

N 


P N Cl >EjC% - 2 


]T£ i (h-h*)(X i )-E£(/ i -h*)(X) 


2=1 




>E££ - 3r£, 
as claimed. 


16 










Lemma 2.8 For a sample V = (Xi J Yi)^L 1 , let d = diam(R(D), L 2 ) and 
recall that f* = ar gm i n ^ G F E (/ ( X ) — Y) 2 . IfV G Af* then 

1. f* € F(Z>), and 

2. for every v € V(V), ||n — Y ||| 2 < ||/* — y||| 2 + 6max d 2 /400}. 

Proof. Fix V = (Xi,Yf)? =1 G Af*. Recall that if f\, f 2 € F C U and 
||/i - /211z, 2 > r u, one has 

«||/i - /2IU2 < M e d^(|/i - /aK^i))^! < / 5 ||/i - / 2 || ia . 

In addition, applying Lemma 12.71 for /C = F, it follows that for every /€f, 

0 < KCj < Pn£j + 3max{r^,p||/ — /*||i 2 }- (2.6) 

Let / be the empirical minimizer in F and consider the following two cases: 
if JI/-/IU 2 > W then Med*(|/- f*\{Xi))f =1 > a\\f - /*||l 2 ; alternatively, 
11/ - /*I|l 2 < r u- Therefore, by ([232) 

P N C F } =P N (f - Y) 2 - P N (f* - Y ) 2 > -3max{r2 , p\\f - /*|||J 
> - 3 max jr/>, pa _2 Medf (|/- /*IP^)). =1 j . 


implying that f* € V(V). 

Turing to the second part, note that f G V(V ), Pn£ F < 0 and that for 
every u G 17, 

Med*(|/-u|(X i ))£L 1 < /3max{r[/, 11/- u||l 2 } < /3 rnax{r[/, d}. 
Hence, it follows from the definition of V{T>) that for every v G V(T>), 

N 
i =1 


PnC-v <Pn + 3 max ^ rfj, pa 2 Medj (\f — n|(Xj 


,0 1 ^ | ft \ ,2 

<3 max ^ r^, p [ — I a 


(2.7) 


Combining (12.71) with (12.61) . for every v G V(V), 

llu - Y ||| 2 - lir - T||| 2 = E£f < 6 max {r^, d 2 /400} , 
by the choice of p. ■ 


17 


Proof of Theorem 12.51 One has to show that the assumptions of Lemma 
m hold for V(T>i) and for W(T>\) for the sample T> 2 . By Lemma 12.81 
f* £ V(V i) and thus, for every v € V(V\), 

lb - Y\\ 2 l2 - ||/* - T ||| 2 < max { 6 r^, d 2 /50} . 

Also, applying Lemma [2171 for W{V i) = W C U, it follows that if V 2 £ A w * 
then for every w £ W, 

IETjf <PnC ™ + 3max{r^,p||u; - w*\ || 2 } 

<PnCW + max{3r^, ||u; — w* ||| 2 /50} 

where Pw is the empirical mean relative to Thus, the assumptions of 
Lemma 12.11 are verified, completing the proof of Theorem 12.51 ■ 


3 The events A Uo 

The final part of the of the proof of Theorem 11.101 focuses on the events 
A uo ■ We will show that for every uq £ U, A UQ is a high probability event 
provided that a, f5 and t are properly chosen constants that depend only on 
q and L , and that rfj = r opt for the right choice of constants. 

3.1 An almost isometric lower estimate 

The main result of this section is an ‘almost isometric’ lower bound on 
infjgj- J2iLi / 2 (Aj) for an arbitrary class T. 

The small-ball method, introduced in hsiiizieiiiis, may be used to show 
that if T satisfies the small-ball condition with constants k,q and e then 

but Co = co(ko,£) is a constant that need not be close to 1. To obtain an 
almost isometric result rather than an ‘isomorphic’ one, a slightly stronger 
assumption is required. 

Theorem 3.1 For every 2 < q < 4 and L > 1 there exist constants c\ 
and C 2 that depend only on q and L for which the following holds. Let 


18 







/C = star(J 7 ) and assume that for every h e 1C — 1C, \\h\\i, < L||/i||l 2 . Set 
71 = q/2(q — 1 ) and 72 = (q — 2 )/ 2 (q — 1 ), and let 0 < £ < 1 and r for which 


®ll^ll(x:-/C)nrD < £v / aV, 


E sup —= 

h£(JC-K)CrrD V N 


N 

^2 £ ih(Xi) 

i=l 


<£ VNr. (3.1) 


Then, with probability at least 1 — 2iVexp(—ci£ 71 iV), if h € 1C 
\\ h h 2 > r, 


1 N 

-£/. 2 (X,)> 11*111, (1-C2f»). 


1C and 


(3.2) 


It is highly likely that the exponents 71 and 72 are not optimal. For 
example, when q = 4 one would expect an estimate of (1 — C 2 £ 1 // 2 )||^|l ! 2 
rather than (1 — C 2 C 1 ' / 3 )IHI | 2 that follows from Theorem 13.11 Fortunately, 
this gap has little effect on the proof of Theorem 11.101 once the value of a 
and j3 is chosen, p = (a/20/3 ) 2 and £ satisfies p = C 2 £ 72 ; thus £ is a small 
but fixed constant that depends only on q and L, and the suboptimal power 
in (13.21) will be of little significance in what follows. 

For the proof of Theorem 13.11 we will first present an almost isometric 
lower estimate for a finite set. In the general case, that set will be an 
appropriate net which approximates J-, and the final step in the proof will 
be an upper estimate on the empirical ‘approximation errors’. 


3.1.1 An estimate for a single function 

Given integers N and m and a function / G L q for some 2 < q < 4, set 


m 


/ 


US) 1/5 ^(/) 


if l/l<(S) 1 /, ll/lli. 
if 1/1 > (SF’ll/lk- 


Hence, <j> is a truncation of / at a level that is selected according to the L q 
space to which / belongs, the sample size N and a parameter m that will 
be used to calibrate the probability estimate. 

Observe that pointwise, |£>(/)| < |/|, and given a sample (Xi)^L 1 , set 


I f = {i: (£>(/)) (Xi) = fiX,)} = {* : |/(A i )| < (iV/m) 1 ^||/|| L9 } . 
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Theorem 3.2 There exist absolute constants Co, ci,C 2 and for 2 < q < 4 
there are constants C 3 , C 4 that depend only on q for which the following holds. 
If N > com, then with probability at least 1 — 2exp(— cim), 

1. \If \ > N — 02 m, and 

2. for every J C {1, N} with \ J\ < 4 m, 



Proof. Observe that Pr{\f\ > (lV/m) 1 / 9 ||/||z /9 ) < m/N. Using a standard 
binomial estimate applied to the event {|/| > {N/Tn)\\f\\L q }, it follows that 
for 0 < u < N/Am, 


Pr(\If \ > um) < 


N 


um 


m\ um / e 
NJ ~ Vn 


and the first claim follows. 

Next, consider h = (0(/)) 2 . Since < {N/m) 2 / q \\f\\ 2 L and \4>(f)\ < 

I/I, 

/ 1\T\ —1+4/q 

\\h\\l 2 =E (/>(/)) 9 • (0 (/)) 4 - 9 < \\f\\ q Lq • (-J II/II17 



Hence, by Bernstein’s inequality (see, e.g., [26]) for h = (<K/)) 2 , 


1 

N 


N 


^( 0 (/)) 2 ( X i )- E (^(/)) 2 


1=1 


< 


{m\ 1 -‘ 2 /<) 

\n) 


ll/ll 


2 

L q 


with probability at least 


1 — 2 exp 


C 2 IV min 




2 exp(—C 2 m). 
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Recall that if i € Ij then |(</>(/))(Xj)| = {N/m) l ^ q \\f\\L q -, and by the first 
part of the claim, \Ij\ < c^m. Therefore, 

1/51 / N\ 2/q 


E (<«/)) 2 ( X < 


x, < 


iG /5 


N \m J 


L q <CA 


1-2 /q 


Also, 


E ^ E ^ 2 l {|/l<Wm)l/ 9 ||/|| i } - E / 2 - E / 2 l {|/|>(iV/m) 1 A||/|U 


and 


E f 2 l {\f\>(N/m)i/«\\f\\ Lq } ~ f 2tPr (l/l 1 

•J u 


{1/1X^/771)1/911/11^} 


> t) dt 


/ at \ 2 /i r°° 

< - 11/111,^(1/1 > {N/m) l / q \\f\\ Lq ) + / 2tPr(\f\ > t)dt 


<- 


q /m\i- 2 /? 


q-2\N 
thus, 


E(«K/)r > e/ 2 - 


2 Q ( m 


1 — 2 /q 


q-2 VAC 

Combining these observations, with probability at least 1 — 2exp(— 02 m), 

1 1 1 ^ 1 

- e / 2 m E (^/)) 2 ( J «) = ^ E (/(/)) 2 (^) - ^ E (/(/)) 2 (^) 


AT 


ieR 


iGlf 


AT 


2—1 


AT 


i€lf 


> E (<7>(/)) 2 - C 4 

>E/ 2 - (=) 


1-2/q 


1-2/5 


and, in a similar fashion, 


/E/ 2 ( x -)£ e / 2 + ^)(v 


AT 


ieit 


N 


1 - 2/5 


for a constant c(g) that depends only on q. 

Finally, note that if i G If then \f(Xi)\ < (Ai/m) 1 / ,J ||/||L 9 . Hence, for 
every J C //, 

lVf 2 m< |J| ^ 2/ " 

N 


vE/ 2 «)<^di7) 

j£J 

which completes the proof. 


AT V m / " Lq ’ 
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3.1.2 A uniform lower bound 

Let F C L 2 and 77 > 0, and set 


® N (F, t?) = IE sup 

fe€(J r — T)r\r)D 


N 




i =1 


+ v- 


When the underlying class F or the sample size N are clear, we will abuse 
notation and write <£( 77 ) instead of &N(F,r]). 


Let N{e,F, L 2 ) be the minimal number of open e-balls with respect to 
the L 2 norm that are needed to cover F, and set 

e m (F) = inf{e > 0 : log N(e, F, L 2 ) < rn} 


to be the m-th entropy number of F. The centres of the balls are called a 
minimal cover of F. 


Lemma 3.3 Let F C rS(L 2 ), set 2 < q < 4 and assume that for every 
/i ,/2 e -Tu{0}, II/! - / 2 || Lg < L||/i-/ 2 ||l 2 . If 1 < 777, < N and e m (F) < 77 
then with probability at least 1 — 2iVexp(—ciTn), 


inf 


1 


N 


fe? N 


fiX,) > r 2 1 - 


- ( n T, 1 



viv/ J ) ’ 


where c\ is an absolute constant and c 2 is a constant that depend only on L 
and q. 


Proof. Fix an integer m and let IF' be a minimal 77 -cover of F with respect 
to the L 2 norm. Since 77 > e m (F) it follows that log \F'\ < m. 

Let / € F and set 7 r/ € F' for which ||/ — 7r/|| l 2 < V- Put v 3 = irf(Xj) 
and Uj = (/ — irf)(Xj), and observe that if / C {1, N} then 

N 

E / 2 (^) > E = E (*/(**) + (/ - t/)pq )) 2 

i= 1 iel iel 

Let Ff = {i : 7 r f(Xf) = ( 0 ( 7 r/))(Aj)} be as in Theorem 13.21 set 
J 7 C {l,...,iV} to be the union of the set of the largest 2m coordinates 


22 





of (|/ — 7 Tf\(Xi))^ =l = and the set of the largest 2m coordinates of 

(| 7 r/(Xj )|)^ =1 = (vi)f =1 . Applying Theorem 13.21 the union bound and the 
L q -L 2 norm equivalence, there is an absolute constant c\ and a constant C 2 
that depends only on q for which, with probability at least 1 — 2 exp(—cim), 
for every nf € T ', 



Next, one has to obtain a high probability estimate on the ‘coordinate 
distribution’ of the vector (|rij|)^ 1 - To that end, fix t > 0 and observe that 
by symmetrization and contraction arguments (see, e.g. mm), 


N 

fE sup |{i : |/- 7 r/|(Xj) >t}\ < E sup Y ]\f - nf\(Xj) 


<2E sup 
SeT 


N 

^Xi) 

i=l 


+ N sup E|/ 


tt/I < 2 JV$(t/). 


Fix tj to be named later and apply Talagrand’s concentration inequality for 
bounded empirical processes [ 2 H GHJ , 4 j to the class of indicator functions 
{l{|/_ 7 r/|>*j> : / £ X}. Thus, with probability at least 1 — 2exp(—m), for 
every / € J 7 , 


}PQ) < c 3 [Esup j- ^ 1 (l/-/l>b}( X *) + \f§ a J + Si 

V 7=1 V / 

where = sup^ e jr T’r 1 ' /2 (|/— irf\ > tj). By the L q and L 2 norm equivalence, 


1 N 

jY 22 ^11 
i=l 


= (*) 




2 ^ E|/ — Trf\ q (Lij 

(Tj < SUp -3- < 1 




tj 


-3 

Therefore, if j > 2m and tj = 04 ( 5 , L)&(rj)N/j , 

^ 2 ^( 77 ) | pm fLrj \ q / 2 


(*)j < c 3 




+ 


N\tj) + N ~ N’ 


because $( 77 ) > 77 . 

Summing for 2m < j < N, with probability at least 1 — 2Xexp(— m), 
for every j > 2m and every / £ F, 


i ■ l(/ - nf)(Xj)\ > 04^(77) — 

3 


< 3 - 


23 









And, on that event for j > 2m, 


N 

u* < C 4 $(r?) —, 

J J 

where denotes a non-increasing rearrangement of 

Hence, 

1/2 / \ 1/2 


E K ) 2 

ij>2m 


< c 4 <f>(r])N j ^2 j 2 1 < cs(q, L)<&(rj)N/y/m, 

yj>2m 


and setting I = I n f\Jf, 


f \ 1/2 

^2 u l) < c 6 {q,L)N 3/2 r<f>(r])/y/m. 

Kiel J 


Recalling the lower estimate on ^2 ie jV 2 , it follows that 



Proof of Theorem 13.11 Let 0 < £ < 1 and set r for which m holds. 
Recall that /C = star(J r ), and thus /C — /C is star-shaped around 0. Hence, 
it is standard to verify that if r' > r than 


E sup 

we{K-K.)r\r'D 




< C r 


and 

^\\G\\(ic-K)nr'D < C VNr'. 

Consider the class J> = star (J 7 )r\rS(L 2 ) = ICnrS(L 2 ). Set rj = cE||G|| F r /y/m 
and note that by Sudakov’s minoration (see, e.g. [23, [IS]), 77 > e m (J>), pro¬ 
vided that c is a well-chosen absolute constant. Therefore, 


rj) < E sup 

we(K,—K)n2rD 


N 


-J2 £ M x i 


i —1 


+ rj < 2C,r + 


cE||G|| 


T r 


m 


Set m = ON for a constant 0 < 6 < 1 to be specified later; thus, 

$N( X r,V) „ r _C_ E||G||jF r 

r - c Ve' rVN ' 
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Moreover, T r C {K—K)V\rD. By the choice of r, E||G||^ r < 1E||G'|| (/c—A^JrrrZ? < 
Qry/N , and 

$/v(-7>,?7) < ciC 

r ~ y/e' 

Thanks to Lemma 13.31 with probability at least 1 — 2N exp(— C 2 ON), for 
every / G J>, 






>r 2 (l-c(,,L) (#l-V. + £0). 


(3.3) 


Setting 0 = £9/' 2 ('?- 1 ) ) the claim follows for / € J> = star(J 7 ) fl rS(L, 2 ). 

Finally, since (13.31) is positive homogeneous and star(J r ) is star-shaped 
around 0, it also holds on the same event when / € star(J 7 ) and ||/|| l 2 > r. 


3.2 The Median of means as a crude measure of distances 

As noted above, the results of PE QUEUE! show that the small-ball method 
suffices to ensure that with probability at least 1 — 2exp(— cN), 

1 N 

a 2 || f-h\\ 2 L2 <-J2(f- h ) 2 (Xi) 

2=1 

for well chosen constants a and c that depend only on the small-ball con¬ 
dition in J 7 , and for every /,li G J whose ^-distance is not ‘too small’. 
However, if class members do not have well-behaved tails, the probability 
that 

1 N 

a 2 \\f - h\\ 2 L2 < - £(/ - h)\Xi) </3 2 \\f - h\\ 2 L2 
2—1 

even for a single pair /, h £ T may be rather small; certainly not of the order 
of 1 — 2exp(— cN). Unfortunately, this means that the empirical mean is a 
poor two-sided estimator of distances, as it lacks stability: if f — h is a heavy¬ 
tailed function, there will be at least one very large value of |(/ — h)(Xi )|, 
and that will destroy any hope of having 

1 N 

^E(/- /i ) 2 (^)^ 2 ii/- /i iiL 

2=1 
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unless /3 is very large. 

To bypass this obstacle, we will use the more stable median-of-means 
functional. 


Let us begin by showing that very little ‘mixing’ is needed for an em¬ 
pirical mean to satisfy a small-ball estimate with a rather high (constant) 
probability. 

Lemma 3.4 For every q > 2 and L > 1 there are constants £ and kq that 
depend only on q and L for which the following holds. If \\Z\\L q < L\\Z\\l 2 
and Zi,...,Zg are independent copies of Z, then 



Proof. Since \\ z h q < L\\Z\\l 2 , it follows from a standard application of 
the Paley-Zygmund inequality (see, e.g., 0) that Z satisfies a small-ball 
condition with constants ci and C 2 that depend only of q and L. Therefore, 

\\Z\\ Ll > c 1 \\Z\\ L2 Pr(\Z\ > d\\Z\\ L2 ) > cic 2 ||Z||l 2 . (3.4) 


By an appropriate version of the Berry-Esseen inequality for independent 
copies of Z € L q for q > 2 [2T], if l > c%(q, L ) then 


sup 
te R 


II e 

P r l J '£\Z i \>E\Z\ + 


t\\\Z\-E\Z\\\ L2 \ 


i=l 


VI 




- Pr (g > t ) 


< 0.05. 


Take t < 0 to be the largest for which Pr (g >t) > 0.8. Applying (13.41) . if 
£ > 4f 2 /(cic 2 ) 2 then E|Z| > 2\t\\\Z\\L 2 /Vi, and 

m + <II|Z| ~jy illLa > (c,c 2 /2)||Z|| fa . 

Therefore, setting hq = cic 2 /2 (which depends only on q and L), 

/, N \ 


Pr 


^\Zi\ > k 0 \\Z \\ L2 >-. 


i= 1 


Fix 2 < q < 4 and L > 1, and set £ and kq as in Lemma 13.41 Without 
loss of generality, assume that N = £M for an integer M, and recall that 
for v € M. N , Med^u) is the median of the vector of means performed in the 
M blocks J 0 ,..., Im —1 ■ 


26 






Theorem 3.5 For every 2 < q < 4 and L > 1 there exists constants 
01 , 02,03 and a < 1 < (3 that depend only on q and L, for which the fol¬ 
lowing holds. Let 7 C I 2 , put 1C = star(J r ) and assume that for every 
w € 1C, ||iy||z, < L||w;||l 2 . Set r > 0 that satisfies 


E \\G\\(K.-K)nrD <c 1 '/Nr, 


E sup —= 

h£(K-K.)nrD VN 


N 

i =1 


< C2'/Nr. 


Then, with probability at least 1 — 2exp(— c%N), for every w G 1C for which 
IM|l 2 > r, 

ol\\w\\l 2 < Med £ (|u;(X i )|)^ 1 < fi\\w\\ l 2 . 

Moreover, on the same event, if \\w\\l 2 < r then 

Med HK^)I)L<^- 


The proof of Theorem 13.51 follows the same lines as the proof of Theorem 
4.3 from m■ It is based on the following observation. 


Lemma 3.6 There are absolute constants c\ and C 2 for which the following 
holds. Consider Z € L 2 that satisfies a small-ball condition with constants 
no and e. If Z \,..., are independent copies of Z, then with probability at 
least 1 — 2exp(— c\5 2 eN) there is a subset I C {1,...,IV} ; |/| > (1 — 5)eN, 
and for every i € I, 

kq\\Z\\ L2 < \Zi\ < C2 \\Z\\l 2 /V5e. 

Proof. Fix 0 < 5 < 1 and let A = {«;o||^||l 2 ^ \Z\ ^ 3||Z|| L 2 /\f5e}. 
Combining the small-ball condition and Chebyshev’s inequality, Pr(A ) > 

1 — (1 + 5/ 3)e. Let rj to be a selector (i.e., a {0, l}-valued random variable) 
with mean (1 + 5/ 3 )e and set q 1, ...,777V to be independent copies of 77. A 
standard concentration argument shows that with probability at least 1 — 

2 exp(— c\5 2 eN), 


N 

|{i : rg = 1}| = y^ y rg < (1 + 6/3) 2 eN < (1 + 5)eN, 

i= 1 


and the claim follows. ■ 

Proof of Theorem 13.51 Let I and hq be as in Lemma 13.41 and recall 
that the two constants depend only on q and L. Assume, without loss of 
generality, that M = N/£ is an integer, set e = 3/4 and fix 0 < <5 < 1 for 
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which (1 — 6)e = 0.6. Set K, = star(J r ), let </i and C 2 to be named later and 
put r > 0 that satisfies 


E\\G\\KnrD < Ci^Nr and 

1 N 

E sup —j= 

h&(K.-lC)C\rD VN 


^2 £ih{Xi 


i =1 


< C 2 VNr. 


Let v € K, and set 


1 


thus (M, 


\M -1 
,j)j=0 


1-1 


i=l 
. M—l 

X^e/, ... . ; j=Q 




are M independent copies of the 


random variable Ai v , which, by Lemma [3.41 satisfies the small-ball condition 
with constants kq and e = 3/4. 

One may verify that 


4«o|MI l 2 < \\M v \\l 2 < \\v\\l 2 - 


By Lemma l3.6l with probability at least 1—2 exp(— c\5 2 eM) = 1—2 exp(— C 2 N), 
there is J C {0,..., M — 1}, | J\ > (1 — 5/2)eM, and for every j € J, 


^o||u||l 2 < M v j < c 3 \\v\\l 2 /VS£ = c 4 | 


\L2 • 


Hence, the same assertion holds uniformly for exp(c2-/V/2) random variables 
of the form M v . And, in particular, for every v € V r C /C D rS/Z^) = IC r , 
which is a maximal ^-separated set for a choice of r/ large enough to ensure 
that |Vr| < exp(c2-ZV/2). 

Therefore, with probability at least 1 — 2exp(—C 2 -/V/ 2 ), for every v € V r 
there is a subset J v C {0,..., M — 1} of cardinality at least (1 — 5/2)eM and 
for every j € J v , 

3 3 

r = ^l\\v \|La < M V j < C 4 r. (3.5) 


By Sudakov’s inequality applied to the set JC r and using the choice of r, one 
may select 


V = C 5 


IEllGkr 
\] C2-/V/2 


< ceCn- 


Next, consider the empirical oscillation term: for every f G JC r , let 7 r/ be 
the best approximation with respect to the L 2 distance of / in V r . Set Uf = 
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l{|/_^|>3K2 r /8}) consider the class of indicator functions ZZ r = {uj : f E /C r } 
and let 

1 N 

tp{X 1 ,...,X N )= sup — Y ]u f {Xi). 

Uf&Ar ^ 

By the bounded differences inequality (see, for example, m, with probabil¬ 
ity at least 1 — exp(—cyt 2 ), 

^{x 1, ...,X N ) < E ip + 

To estimate E^ from above, set cj){t) = t/{2>K^r /8). Observe that for every 
Uf E U r , Uf(X) < 4>(\f—Trf\(X)), and that by the Gine-Zinn synmretrization 
theorem mm and the choice of r, 


N N 

E sup x^2 u f( Xi ")- Esup 


1 If&Ar 

N 


/6/c, N , =1 


i— 1 


<E sup 
/6/C, 

<4- ' [ E sup 


yVo!/- vr/IPQ)) - E0(|/ - tt/IPQ)) 


+ sup E0d/ - vr/l) 
/6/C, 


K o r 


/6/C, 


/V 




2—1 

, 1 ,5s 

<— • (C2r + t?) < — 
4/ 


+ sup ||/ - tt/|| L2 

/6/C, 


when ~ Kq/Z and C 2 ~ Kq/Z, an d th us depend only on q and L. 

Setting t = 5 sVn/M, it follows that with probability at least 1 — 
2exp(— cs(q, L)N ), for every / E /C r = stai/J 7 ) H 


|{* : 1/ - tt/ITO > (3«§/8)r}| < <5elV/2/ = feM/2. 


Therefore, at most SsM/2 of the M ‘bins’ Ij contain a sample point X t for 
which |/ — 7r/|(Xj) > (3Ko/8)r; on the remaining (1 — 5s/2)M bins, 

t/KY) ^ (3K§/8)r. 

i&ij 

Hence, with probability at least 1—2 exp(— cg(q, L)N), for every / E star (J 7 ) hi 
rS(L, 2 ) there is a subset of {0,..., M — 1} of cardinality at least (1 — 5)sM = 
0.6 M, on which 

\M fJ \ >\M nfd \ - \M f - nf j\ > (3k§/4 )r - (3/cg/8)r = (3«g/8)||/||i, 2 , and 
|A4/j| + |TW/_ 7r/J | < (c 4 + 3 ko/8)||/||l 2 - (3-6) 
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Moreover, since the estimates are positive homogeneous and star(J 7 ) is star¬ 
shaped around 0, (13.61) is true on the same event when / £ star(J 7 ) and 
||/||i 2 > r • The claim follows by recalling that l and kq depend only on q 
and L , and selecting 0 < a < 3no/8 and f3 > C 4 + 3kq/8. 

The proof of the second part is almost identical: V r is defined exactly as 
above, and for every / £ star(J 7 ) n rD , irf is the best approximation in V r ; 
thus, II/ —vr/11£ 2 < 2r. Just as in the proof of the first part, with probability 
at least 1 — 2exp(— cg(q, L)N), for every / £ starp 7 ) n rD , 


-nf\(X i )>3r}\<5eM/2. 
Thus, on at least (1 — 5)eM = 0.6 M of the ‘bins’ 

\ nf,j\ — C10 (.Qi L)r, 

and one may choose (3 = max{c 4 + 3kq/8, ciq}. 


4 Proof of Theorem 11.101 


Observe that the second the third conditions in the definition of A uo are 
independent of uo, and we shall begin by verifying those. 

Given the base class F, recall that U = {(/i + : /i, /2 € T} and 

that H = star(17 — U). Thus, for every h £ H, ||/i||l 9 < L||/i||£ 2 . Let ro be 
the infimum of the set of all r > 0 for which 


IE||G||//nrD < Ci (q,L)y/Nr and 



(4.1) 


E sup 
h&(H-H)r\rD 


i =1 


for constants Cl and C2 as in Theorem 13.51 

Since H and H — H are star-shaped around 0, m holds for every 
r > ro- Invoking Theorem 13.51 for T = U — U and r = 2ro, there are 
constants a < 1 < (3 and t that depend only on q and L for which, with 
probability at least 1 — 2 exp(— co(q, L)N), for every h £ H, 


• if IP||l 2 > 2r 0 then a\\h\\ L2 < MedP|/i(X i )|)(l 1 < /3\\h\\ L2 - 

• if IH|l 2 < 2r 0 then Med £ (|/i(X i )|)(l 1 < [3 ■ 2r 0 . 
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In particular, for any rjj > 2ro, the third condition in the definition of A uo 
is verified. 

Next, let a and j3 be as above and set p = (a/20/3) 2 . Consider Theorem 
IQ for T = U — U (and in which case, /C = star '(U — U) = H ). Recall that 
71 = q/2(q — 1) and 72 = (q — 2)/2 (q — 1) and set £3 by p ~ Q 2 , and in 
particular, £3 depends only on q and L. Set r 1 for which 

^•WGW/H-H/nriD < CsVNti, and 


E sup 
he/H-H/n^D 


N 


7N^ MX,) 


< C3 VNti. 


1 w 

-^h 2 (X,)>( 1 -p)]|h||i 2 . 


It follows that with probability at least 

1 — 2iVexp(— ciC^N) > 1 — 2exp(—C2(q,L)p' yi ^' y2 N), 

if h & H — H and satisfies ||/i||i 2 > ri then 

N 

N 

i=l 

Moreover, since 0 £ If, the same is true for every difference h = u\ — U 2 € 
U — U C H — H provided that ||ui — U2\\l 2 > r i- Thus, the second part in 
the definition of A Uq holds for rjj > r\. 

Turning to the first part of the definition of A Uo (which does depend on 
uo ), one may apply Lemma [1771 to the set U and for 

r 2 > r M (F,p/4,6/2,u 0 ). 

Setting £ = uq(X) — Y and £* = uo(Xi) — Yi, it follows that with probability 
at least 4 — 6, for every u € U, 


N 


- u o)( x i ) - E£(u - u 0 ) 


2=1 


< pmax{||u-u 0 ||| 2 ,r|} , 


as required. 

Finally, one may combine all the above conditions, by noting that for 
C 3 = p/4, C 4 = min{(( 2 ,C 3 } and C 5 = min{Ci, £ 3 }, the choice of r 2 = 
r 0 p t (F, 5 ,C 3 ,C 4 ,c 5 ) is a valid choice in all of the above. Hence, for every 
uo € U, 

Pr(A U0 ) >1-8- 2exp(-c 6 (q,L)N), 
and Theorem 11.101 follows from Theorem 12.51 ■ 
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4.1 Proof of Corollary 11.121 

Let F be a finite dictionary. While a learning procedure can only guarantee 
an error rate of the order of \J iV _1 log M, one may show that the aggregation 
procedure suggested above leads to a much better estimate. 

Let us begin by reformulating Corollary 11.121 

Theorem 4.1 For every L > 1 and q > 2 there exist a constant c\ that 
depends only on L and q for which the following holds. Let F = {/i,..., /m} 
and assume that for w € span(F) and every p > 2, ||iy||x, < L^/p\\w\\l 2 - 
Assume further that Y £ L q for some q > 2. Then for every 0 < 5 < 1. with 
probability at least 1 — 5, 

E ((/(X) - < E(f*(X)-Y) 2 + Cl 5~ 2/q log(2/5)\\f*-Y\\ 2 Lq l -^j^- 

As all the assumptions of Theorem II. 101 are satisfied here, what is left is 
to identify r op t- To that end, note that \U — U\ < M 4 . Thus, for every r > 0 
there is k r < M 8 and functions i , that satisfy ||it;i jT . ||x, 2 < r an d 

star (H — H ) H rD C {A Wi jT : 1 < i < k ri 0 < A < 1} = W r . 


By the moment equivalence in span(l ? ), a straightforward chaining argument 
and the Majorizing Measures Theorem (see, e.g., m for similar arguments) 
it follows that 


E sup 

W eWr 


1 

VN 


N 

i— 1 


< Cl LE||G|| Wr . 


And, it is standard to verify that 

E||Gj|vr r = E sup G Wi < c 2 r v / logM. 

1 <i<k r 


Therefore, if N > C 3 (L,C)logM, then 


rQ,i(F, 0 = r Qi2 (F,C) = 0. 


Turning our attention to vm, one may invoke the following fact from [20]: 


Theorem 4.2 Let £ € L q for some q > 2 and assume that for every f,hG 
T U {0} and every p > 2, ||/ - h\\ Lp < Ly/p\\f - h\\ L2 . Then, for every 
u,w > 1, with probability at least 


1 - c 0 (q)w q 


log 9 N 
Ni / 2 - 1 


2 exp 


^-c 1 (L)u 2 


E||g|b 

diam(J 7 , L 2 ) 



(4.2) 
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one has 




sup —L^£j&/pfj) <C 2 {q)Lwu\\i\\ Lq E\\G\\jr. 


For every uq € U let F = star(C7 — uq) n rD. Since |U — ito| = \U\ < M 2 
then by Theorem 14.21 and with probability as in (14.21) . 


sup -4=V ]£iCif(Xi) < c(q)Lwu\\i\\ Lq ryJ\ogM. 
VN “ 


Therefore, if 



then 


sup < CVNr 2 . 


feJ 7 viV 


Clearly, for any nontrivial class J 7 , E||G||j- > diarn^, L 2 ); thus, setting 


w ~ (1 /5) l / q and u ~ A v /log(2/, with probability at least 1 — 4 


r M < ci(q)^ ■ 5 1 / q \og 1/2 {2/5)\\i\\ Lq ^^- 


which completed the proof of Theorem 14.11 
4.2 A remark on the bounded case 

Let us briefly mention a way in which one may obtain a version of Theorem 


ll.lOl when both the dictionary and the target are assumed to be bounded in 


L^, but F may be infinite. 

As noted in m, an Lqo type of assumption is of a very different nature 
than an assumption on norm equivalence: the former does not lead to a 
useful small-ball estimate on class members, and in particular, the proofs 
presented in Section [3] do not hold in that case. 

Fortunately, there are highly potent tools at one’s disposal when bounded 
classes are concerned, namely, Talagrand’s concentration inequality for bounded 
empirical processes and the contraction principle for empirical and Bernoulli 
processes indexed by bounded classes (see, e.g., naisasi)- Using that well 
established machinery, one may show that A uo is a high probability event. 
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In fact, thanks to the two-sided concentration estimates, the argument is 
much simpler. 

For example, assuming that the functions involved are bounded by 1 
almost surely and applying a contraction argument, it follows that with 
high probability and in expectation, 


SU P 4 X>oPQ) - !"<)(« - uo )(Xi) - E(uo(X) - Y)(u - u 0 )(X) 
ugu a 



and 



where ( Si)f =1 are independent, symmetric {—1, l}-valued random variables 
that are independent of (Xj, 

Moreover, the multiplier and quadratic processes concentrate well around 
their mean, leading to a natural complexity parameter that is rather similar 
to r op t, and to an exponential probability estimate. 

The obvious downside in this concentration-contraction based argument 
is that it totally eliminates the dependence on the distance between F and 
Y (see the discussion in na \M for more details). As an outcome, the 
estimate in the bounded case does not improve when the problem becomes 
more ‘realizable’. 
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